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Abstract 

C^) The M-tree is a fundamental tool in computer science. Among other applications, the application of M-tree search (by the tree 
C^l method) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexity 
Q pf these problems is reduced from 0(N 2 ) for a brute force method to 0(N log N) for the tree method, where N is the number of 
D particles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). We 
present a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we found 
important is localized particle ordering to effectively utilize cache memory. We present a number of test results and performance 
(~~ \ measurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and efficient. 
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1. Introduction 

A technique for gravitational many-body simulations is a 
fundamental tool in astrophysical simulations because the grav- 
itational force drives structure formation in the universe. The 
length scales that arise in structure formation range from less 
than 1 cm for the aggregation of dust to more than 10 24 cm for 
the formation of cosmological structures. At all scales, grav- 
ity is a key physical process for the understanding of structure 
formation. The reason behind this is the long-range nature of 
gravity. 

Suppose we simulate structure formation with N particles. 
The flow of the many-body simulation is as follows. First, we 
calculate the mutual gravitational forces between the N parti- 
cles, then integrate the orbits for the N particles, and repeat 
this process as necessary. Although it is simple in principle, 
the force calculation is a challenging task from the point of 
view of computer science. A simple, exact method for the force 
calculation requires 0(N 2 ) computational complexity, which is 
prohibitively compute-intensive for large N. An exact force 
calculation is necessary in some types of simulations, such as 
few-body problems, the numerical integration of planets orbit- 
ing around a star (e.g., the Solar System), and the evolution of 
dense star clusters. For simulations that do not require exact 
forces, ho wever, several approximation techniques have been 
proposed dHocknev & Eastwood , Il98lt iBarnes & Hud. Il98d 
Greengard & Rokhlinl Il987). The particle-mesh /particle- 



particle-mesh method ( Hockney & E astwood 11981 *) and the 
tree method (Barn es & Hull 19861) reduce the computational 
complexity of the force calculation to 0(N log N). The 
fast multipole method (FMM ) reduces it further to O(N) 
dGreengard & Rokhlinl 119871) . Of these methods, the tree 
method has been used extensively in astrophysical simulations, 



since its adaptive nature is e ssential for dealing w i th clu mpy 
structure in the universe (e.g. jBouchet & Hernquist , 1988 ). 



Despite the 0(N log N) complexity, computational optimiza- 
tion of the tree method by techniques such as vectoriza- 
tion and parallelization is necessary to acco mmodate demand s 
for simulation s wit h larger and la rger N. IHernquist (1990), 
lMakinoldl990h . and Barnes (1990) have reported various tech- 
niques to vectorize t he force calculat ion with the tree method . 
IWarren et al.1 dl992hJDubinskl d 1996b . and lYahagi et~aildl999b 
have reported a parallel tree method for ma ssively parallel 
proce ssors (MPPs). In a recent publication (Spri ngel et all 



2005), a simulation of large-scale structure formation in the 



universe with more than ten billion particles, using a paral- 
lel tree code running on an MPP, has been reported. Another 
computational technique to speed up t he tree method utilizes 
the GRAPE special-pu rpose computer dSugimoto et al. , 1990; 



Mak ino & Taiii . 1998 ). Using a combination of vectorization 



techniques for the tree method, t he tree method can be executed 
efficiently on a GRAPE system (Makino[ fl99lh . 



Cosmological simulations are a "grand challenge" 
lem. The Gordon Bell prizes have be en awarded 
times for cosmological simulations ( Warren & Salmon, 
1 992; iFukushige & Makinol 1 1 996k IWarren et all 1 19971 1 1998 



prob- 
many 



Kawaietall 1 19991: lHamada et all l2009b. In thos e 



ulations, both parallel tr ee codes dWarren & Salmo n, 1992; 
Warren et all 1 19971.1 19981) and a tree code running on a GR APE 



system dFukushige & Makino , 19961: Kawai et a . . 1999b and 



a graphics processing unit (GPU) dHamada et al. 



2009) 



were 



used to perform cosmological simulations. 

In the present paper, we describe our implementation of the 
tree method on a GPU. The rise of the GPU forces us to re- 
think our way of doing parallel computing, since the perfor- 
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mance of recent GPUs has reached the impressive level of > 1 
Tflops. Acceleration techniques for many-bod y simulations 



with a GPU have already b e en reported (e.g., |N yland et al 



2007; P ortegies Zwart et all l2007t lHamada & Iitakal 12007 . 
Bell eman et al. . 20081) : however, these techniques have imple- 



mented an exact, brute force method with 0(N 2 ) complexity. It 
is apparent, however, that for applications that do not require 
exact forces, it is possible to do much more efficient compu- 
tation by the tree method. We have directly implemented the 
tree method on a GPU so that we can enjoy the speed of an 
0(N log N) algorithm. For small N < 30,000, the brute force 
method on a GPU is faster than the tree method owing to extra 
work concerning the tree data structure. However, our results 
show that the tree method significantly outperforms the brute 
force method on a GPU for N » 10,000, which is the stan- 
dard size in current astrophysical simulations is. Our code is 
simple and easy to extend to other numerical algorithms that 
require a neighbor list or a short-range force, such as algo- 
rifhms for the smoothed particle hydrodyn amics (SPH) method 
dGingold & Monaghanj[l977l:lLucvill977h . 



2. GPU architecture 

In this section, we briefly summarize the architecture of the 
Cypress GPU that we us ed in the present work (most of the 



Cypress UfU mat we used in tne pre sent \ 
information is taken from AMD Inch 2010)) 



2.1. Cypress architecture 

The Cypress GPU, from AMD, is the company's latest GPU 
and has many enhancements for general-purpose computing. It 
has 1600 arithmetic units in total. Each arithmetic unit is capa- 
ble of executing single-precision floating-point fused multiply- 
add (FMA) operation. Five arithmetic units make up a five- 
way very-long-instruction-word (VLIW) unit called a stream 
core (SC). Therefore, one Cypress processor has 320 SCs. One 
SC can execute a several combinations of operations such as 
(1) five 32-bit integer operations, (2) five single-precision FMA 
operations, (3) four single-precision FMA operations with one 
transcendental operation, (4) two double-precision add oper- 
ations, or (5) one double-precision FMA operations. Each 
SC has a register file of 1024 words, where one word is 128 
bits long (four single-precision words or two double-precision 
words). A group of 16 SCs make up a unit called a compute 
unit. At the top level of the GPU, there are 20 compute units, a 
controller unit called an ultra-threaded dispatch processor, and 
other units such as units for graphics processing, memory con- 
trollers, and DMA engines. 

All SCs in the compute unit work in a single-instruction- 
multiple-thread (SIMT) mode, i.e., 16 SCs execute the same 
instructions for four clock cycles to accommodate the latency 
of the arithmetic units. That is, we have 64 threads proceed- 
ing as a wavefront on the Cypress GPU. At the time of writ- 
ing, the fastest Cypress processor runs at 850 MHz and offers 
a peak performance of 1600 x 2 x 850 x 10 6 = 2.72 Tflop/s in 
single-precision operations. With double-precision operations, 
we have 320 x 2 x 850 x 10 6 = 544 Gflop/s. 



The external memory attached to the Cypress consists of 1 
GB of GDDR5 memory with a 256 bit bus. It has a data clock 
rate of 4.8 GHz and offers a bandwidth of 153.6 GB/s. This 
external memory is accessed through four banks, as shown in 
Figure Q] In each bank, there is a second-level read cache (L2 
cache). The total size of the second-level cache is 512 KB, i.e., 
4 x 128 KB. Twenty compute units and memory controllers are 
interconnected through a crossbar. Each compute unit has a 
first-level read cache (LI cache) and a local data share (LDS), 
as depicted in FigureQ] The sizes of the LI cache and LDS are 
8 KB and 32 KB, respectively. The LI cache can fetch data at 
54.4 GB/s when the cache is hit; namely, the aggregate band- 
width of the LI cache on the Cypress GPU is 54.4 GB/s X 20 
= 1.088 TB/s. This high memory bandwidth is a notable fea- 
ture of this GPU. As we shall describe in the following section, 
taking advantage of the hardware-managed cache is critical to 
obtaining high performance on the Cypress GPU. 

2.2. Programming the Cypress GPU 

In the present work, we programmed the Cypress GPU using 
an assembly-like language called IL (Intermediate Language). 
IL is like a virtual instruction set for GPUs from AMD. With 
IL, we have full control of every VLIW instruction. The pro- 
gramming model supported by IL is a single-instruction-and- 
multiple-data (SIMD) model at the level of the SC. In this pro- 
gramming model, a sequence of instructions generated from an 
IL program is executed on all SCs simultaneously with different 
input data. 

A block of code written in IL is called a compute kernel. The 
device driver for a GPU compiles IL instructions into the cor- 
responding machine code when we load a kernel written in IL. 
In a compute kernel, we explicitly declare what type of vari- 
able the input data is. In the main body of the IL code, we 
write arithmetic operations on the input data. Logically, each 
SC is implicitly assigned data that is different from that for ev- 
ery other SC. In the case of a simple compute kernel, the SC 
operates on the assigned data. Operations such as this, as arise 
in pure stream computing, seem to work with the highest effi- 
ciency. In a complex compute kernel, which we explore in the 
present work, each SC not only operates on the assigned data 
but also explicitly loads random data that might be assigned to 
another SC. To accomplish a random access to external mem- 
ory, we explicitly calculate the address of the data in the com- 
pute kernel. 

The ATI Stream software development kit (SDK) for the Cy- 
press GPU also supports OpenCL, which is a standard API with 
an extended C language (hereafter referred to as C for OpenCL) 
for writing a compute kernel. In this work, we also present 



a compute kernel written in C for OpenCL (see Appendix A 
for the code). We believe that it is instructive to present our 
algorithm in C for OpenCL and that this makes the algorithm 
easy to understand. Both programming methods (using IL and 
using C for OpenCL) have pros and cons. With IL, we have 
the advantage of full control of the VLIW instructions, but a 
compute kernel written in IL is somewhat cumbersome. On 
the other hand, it is is easier to start writing a compute ker- 
nel in C for OpenCL, but optimization for any particular GPU 
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Figure 1 : Block diagram of the Cypress GPU, with emphasis on the memory system. 



architecture is not straightforward. An advantage of program- 
ming with OpenCL is that we can use OpenCL to program a 
general-purpose many-core CPU. In the following section, we 
compare implementations of the tree method on a GPU based 
on compute kernels written in IL and in C for OpenCL. We also 
compare the performance of a compute kernel written in C for 
OpenCL on a GPU and on a CPU. 



3. Bare performance of brute force method on a GPU 

So far, we have developed around a dozen kernels in IL that 
we use for astrophysical many-body simulations. In this sec- 
tion, we report the performance of our implementation of a 
brute force method for computing gravitational forces. This 
code served as a basis for us to implement a more sophisticated 
algorithm later. 

To be precise, we have implemented a set of conventional 
equations expressed as 



Pi 



Yj P(Xi,Xj,mj)= ^ 



N 



N 



:\ Xi ~Xj\ 2 + e 2 )V 2 ' 



where a -, and p-, are the force vector and potential for a particle 
i, and Xu mi, e are the position of the particle, its mass, and a 
parameter that prevents division by zero, respectively. We can 
solve these equations by two nested loops on a general-purpose 
CPU. In the inner loop, we simultaneously evaluate the func- 
tions p and /, and require 22 arithmetic operations, which in- 
clude one square root and one division, to compute the interac- 
tion between particles i a nd j. Since previous authors, starting 
from IWarren et al. (1997), have used a conventional operation 
count for the evaluation of /, and we have adopted a con- 
ventional count of 38 throughout this paper. 

Elsen et al. reported an implementation of a brute 

force method for gravitational and other forces on an old 
GPU from AMD/ATi. One of the main insights obtained 
was that a loop-unrolling technique greatly enhanced the per- 
formance of the code. We have followed Elsen et al.'s ap- 
proach and tried several di fferent methods of loop unrolling. 



(1) 



Fuiiwara & Nakasato (2009) have reported our optimization ef- 
forts for old GPUs. Here, we present a summary of our results. 
In Figure [2] we plot the computing speed of our optimized IL 
kernel for computing Eq.dTJ as a function of N. We tested 
three GPU boards, namely RV770 GPUs running at 625 and 
750 MHz and a Cypress GPU running at 850 MHz. The three 
systems had peak computing speeds in single precision of 1.04, 
1.2, and 2.72 Tflop/s, respectively. So far, we have obtained a 
maximum performance of ~ 2.6 Tflop/s on the Cypress GPU 
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Figure 2: Performance of the brute force method on various GPUs 

for N > 150, 000. For N = 195, 584, our optimized brute force 
method took roughly 0.5 s on the Cypress GPU. As far as we 
know, the performance that we obtained is the fastest ever with 
one GPU chip. 

Even with the massive computing power available on such 
GPUs, however, we cannot escape from a computational com- 
plexity of 0(N 2 ). Therefore, if we need to do an astrophysical 
many-body simulation for large N, we need a smart algorithm 
to do the job, since the recent standard for N in astrophysical 
simulations is at least 100,000 for complex simulations with 
baryon physics and 1,000,000 for simple many -body simula- 
tions. 

4. Tree method on a GPU 

4.1. Tree method 

The tree method (IBarnes & Hut . Il986h is a special case of 
the general M-tree algorithm. This method has been optimized 
to efficiently calculate the mutual forces between particles, and 
reduces the computational complexity of the force calculation 
from 0(N 2 ) for the brute force method to 0(N log N). A trick 
used is that instead of computing the exact force by a brute 
force method, it approximates the force from distant particles 
using a multipole expansion. It is apparent that there is a trade- 
off between the approximation error and the way in which we 
replace a group of distant particles by a multipole expansion. 
A tree structure that contains all particles is used to judge this 
trade-off efficiently. 

The force calculation in the tree method is executed in two 
steps: (1) a tree construction and (2) the force calculation. In 
the tree construction, we divide a cube that encloses all of the 
particles into eight equal sub-cells. The first cell is the root of a 
tree that we construct; it is called the root cell. Then, each sub- 
cell is recursively subdivided in the same say until each cell 
contains zero or one particle. As the result of this procedure, 
we obtain an oct-tree. 

In the force calculation, we traverse the tree to judge whether 
we should replace a distant cell that contains a group of particles 



procedure treewalk(i, cell) 
if cell has only one particle 

force += f(i, cell) 
else 

if cell is far enough from i 

force += f _multipole (i , cell) 
else 

for i = 0, 7 

if cell->subcell [i] exists 

treewalk(i, cell->subcell [i] ) 

Figure 3: Pseudocode for the force calculation by traversing the tree 

that are geometrically close together with the multipole expan- 
sion of those particles. If we do not replace the cell, we then 
traverse sub-cells of the distant cell. If we do replace the cell, 
we calculate a particle-cell interaction. When we encounter a 
particle, we immediately calculate a particle-particle interac- 
tion. Given a particle with index i which we want to com- 
pute the force acting on, this procedure is expressed as pseu- 
docode in Figure [3] Note that subcell [] is a pointers to its 
own sub-cells. In this pseudocode, f is a function that com- 
putes the particle-particle interaction, and f _multipole is a 
function that computes the particle-cell interaction. In the work 
described in this paper, since we considered only the monopole 
moment of a cell, both functions were expressed exactly as in 
Eq. ([TJ. In principle, we can use any high-order moment in the 
particle-cell interaction. 

We follow this procedure starting from the root cell, with 
the following condition that tests whether a cell is far enough 
away. Let the distance between the particle and the cell be d. 
The cell is well separated from the particle if l/d < 6, where 
/ is the size of the cell and 6 is a parameter that controls the 
trade-off. Since the smaller l/d is, the more distant the cell is 
from the particle, this condition (called the opening condition) 
tests geometrically whether the cell is far from the particle. This 
recursive force-calculati on procedure i s almo st the same as in 
the original algorithm of Bar nes & Hut (fl986h . 

An important feature of the tree method is that with tree 
traversal, the force calculations for different particles are com- 
pletely independent of each other. Therefore, after we have 
completed the tree construction, the force calculation is a mas- 
sively parallel problem. There are two possible ways to im- 
plement the tree method on a GPU to take advantage of this 
feature. 

4.2. Tree method with GRAPE 

One way is a method proposed by Makinol (1991). This 
method was proposed as a tree method for the special-purpose 
computer GRAPE. A GRAPE system consists of a host com- 
puter and a GRAPE board or boards. The host computer con- 
trols the GRAPE board. For a program running on the host, the 
GRAPE board acts like a subroutine that calculates the gravita- 
tional forces for given particles. 

So, we need the following two steps to use a GRAPE system 
for a force calculation using the tree method: (1) construction of 



4 



an interaction list on the host computer, and (2) the actual force 
calculation on the GRAPE board. The interaction list is a list 
of particles and distant cells that are supposed to interact with a 
given particle. After the construction of interaction lists for all 
particles is completed, we compute the force on each particle 
by sending the interaction lists to the GRAPE board. These two 
steps are necessary because the GRAPE board does not have the 
ability to traverse the tree. Many authors have used this method 
extensively. Three winners and a finalist of the Gordon Bell 
prize have used a variant of this m ethod with a different version 
of the GRAPE system and a GPU JFukushige & Makinolll99e 



Kawai et all 119991: lHamada et all 120091 iKawai & Fukushigel 
2006h . A drawback of this approach is that the performance is 



limited by the speed of the host computer that is responsible 
for the tree traversal. This possible bottleneck, which is sim- 
ilar to Amdahl's law, might be critical without a highly tuned 
implementation of treewalkO running on the host. Further- 
more, i n all of the results p resented by Fukushige & Makinol 
d 19961). iKawai etalJ dl999h . IKawai & Fukushigil2006|), and 



Hamada et al.l (12009b . extra force evaluations by a factor of two 
were required to obtain the best performance. Note that be- 
cause of the extra force evaluations, the maximum error in the 
force that these authors have reported was better than the error 
obtained with the conventional tree method for a given 9. 

4.3. General tree walk 

Another way to implement the tree method, which we have 
followed in the present work, is to implement the whole pro- 
cedure shown in Figure Hon a GPU. The advantage of this 
approach is that only the tree construction, which requires rel- 
atively little time, is executed on the host, so that we utilize 
the massive computing power of the GPU as much as possi- 
ble. More importantly, we can use our met hod in applications 
that r equire short-range interaction forces (Wa rren & Salmon!. 
fl995h - This is because it is possible to implement a neigh- 
bor search algorithm as a general tree-walk procedure of the 
kind shown in Figure [4] Two procedures, proc_particle 
and proc_cell, are used to process the particle-particle and 
particle-cell interactions, respectively. In addition, a function 
distance_test is used to control the treatment of a distant 
cell. The calculation of the gravitational force is an application 
of the general tree-walk procedure that has been very success- 
ful. 

4.4. Our GPU implementation 

In our implementation of the tree method on a Cypress GPU, 
we first construct an tree on the host computer that controls the 
GPU. At this stage, there is no difference between our original 
tree code and the newly developed code for the GPU. 

We need to take special care in implementing the tree-walk 
procedure on the GPU. Currently, GPU architecture does not 
support recursive procedures except when it is possible to fully 
expand a recursion. Such a full expansion is possible only if 
the level of the recursion is fixed, but in the tree method, it is 
impossible to know how deep the recursion will be without per- 
forming the tree traversal. So, we adopted a method proposed 



procedure general_treewalk(i , cell) 
if cell has only one particle 

proc_particle (i , cell) 
else 

if distance_test (i , cell) is true 

proc_cell(i, cell) 
else 

for i = 0, 7 

if cell->subcell [i] exists 

general_treewalk(i , cell->subcell [i] ) 

Figure 4: Pseudocode for a general tree-walk procedure. 

procedure treewalk_iterative (i) 
cell = the root cell 
while cell is not null 

if cell has only one particle 
force += f(i, cell) 
cell = cell->next 
else 

if cell is far enough from i 
force += f _multipole (i , cell) 
cell = cell->next 

else 

cell = cell->more 

Figure 6: Pseudocode for an iterative tree-walk procedure. 



bv lMakino (1990) that transforms a recursion in treewalkO 
into an iteration. A key feature is that for a given cell, we do 
not need whole pointers (subcell [] ) to traverse the tree. We 
need only two pointers, to the cells that we will visit next when 
the opening condition is true and when it is false, respectively. 
These two pointers (hereafter called next [] and more [] ) are 
easily obtained by a breadth-first traversal of the tree. Figure [5] 
shows next [] and more [] schematically. Note that a cell that 
has sub-cells has both a next [] and a more [] pointer, while 
a leaf cell (a particle in the present case) with no sub-cells has 
only a next [] pointer. An iterative form of treewalkO with 
these two pointers is shown in Figure [6] 

We implemented the iterative procedure treewalkO rather 
directly in IL. The input data for this compute kernel is four ar- 
rays. The first contains the positions and masses of the particles 
and cells. We pack a position and a mass into a vector variable 
with four elements. Therefore, this array is an array of four- 
element vectors. The mass of a cell equals the total mass of the 
particles in the cell, and the position of the cell is at the center 
of mass of the particles. The second and third arrays contain 
the "next" and "more" pointers, respectively. Both of these are 
simple arrays. The fourth array contains the sizes of the cells. 
The size of the cell is necessary for testing the opening con- 
dition. See the description in Appendix Appendix A for the 
definitions of these arrays. 

In the present work, we adopted the following modified 
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Figure 5: A tree with "more" and "next" pointers, shown by blue and red arrows, respectively. 



opening condition, expressed as 



Table 1 : Our test system 



- + S < d, 



(2) 



where s is the distance between the center of the cell and the 
center of mass of the cell. The modified condition of Eq. (f2]l 
takes the particle distribution in the cell into account through s 
since if particles gather at a corner of a cell, the effective size 
of the cell becomes larger. In Figure [7] we present a schematic 
view of a distant cell and a particle which we are trying to calcu- 
late the force acting on. In practice, we precomputed the square 
of the effective size S effective as 



/ 

>J effective — T + * 



(3) 



and sent S effective instead of I for each cell. With S effective, we 
do not need to compute the square root of d, and we simply 
compare S effective and d 2 during the tree traversal. 

In Figure [8] we present an abstract version of our compute 
kernel written in IL. In IL programming, each SC executes the 
compute kernel with the assigned data in parallel. In this code, 
own represents the specific cell assigned to each SC; =, load, 
and -> are not real IL instructions or operations but conven- 
tional symbols used here for the purpose of explanation. We 
have omitted the calculation of the load addresses for the ar- 
rays since it is too lengthy to show in detail. In addition, the 
particle-particle and particle-cell interaction codes have been 
omitted because they simply compute the functions / and p in 



Eq. (Q]i. In Appendix A we present a working compute kernel 
written in C for OpenCL. We present a performance compari- 
son between the IL and OpenCL implementations in the next 
section. 



CPU 


Intel Xeon E5520 x 2 


Memory 


DDR3 800 1GB x 6 


GPU 


Radeon 5870 memory 1GB 


OS 


Ubuntu9.10(64bit) 


Driver 


Catalyst 10.8 (fglrx 8.76.7 [Aug 3 2010]) 


SDK 


ATI Stream SDK 2.2 



With the compute kernel shown, the flow of our tree method 
on the Cypress GPU is as follows. 

1. Construct a tree (host). 

2. Compute the total mass, the center of mass, and the effec- 
tive size of each cell (host). 

3. Compute the "next" and "more" pointers (host). 

4. Send the input data to the GPU (host). 

5. Iterative tree walk associated with the force calculation for 
each particle (GPU). 

6. Receive the force for each particle from the GPU (host). 

We have indicated whether the corresponding part is executed 
on the host or the GPU in bold text at the end of each step. 

5. Tests and optimization 

Here, we describe the results of some basic tests to show 
that our code worked correctly, and to obtain some performance 
characteristics. We used the configuration shown in TableQ]for 
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center of the cell 



Figure 7: Schematic view of a distant cell and a particle (shown by a solid purple point). The black solid points are particles that belong to the cell. The large red 
point is the center of mass of the particles in the cell. 



all results presented in this paper. In the basic tests, we used a 
set of particles randomly distributed in a sphere. 

First, Table |2] shows how the computing time depends on N. 
Each value of computing time was obtained by averaging the re- 
sults of 20 runs. In this test, we set 8 = 0.6. r tota i and r construct i on 
are the total time required for the force calculation and the time 
spent on the construction of the tree, respectively. Roughly, 
the tree construction took 20-28% of r tota i. For all values of 
N used, we checked that there was effectively no error in the 
force computed by the GPU. Q All operations on the GPU were 
done with single-precision, and we observed that the error was 
comparable to the machine epsilon, ~ 10~ 6 . We believe that 
the error originates from a difference in the implementations 
of the inverse of the square root on the host and on the GPU. 
We consider that this is not at all significant for our purpose of 
astrophysical many-body simulations. 

Regarding computing speed, randomly distributed particles 
are the most severe test because two successive particles in the 
input data have a very high chance of being at different posi- 



Table 2: Dependence of computing speed on N: no sorting 



N 


Ttotal (s) 


^construction 


(s) 


50K 


3.95 x 10~ 2 


1.1 x 10" 


-3 


100K 


9.65 x 10~ 2 


2.5 x 10" 


-2 


200K 


2.34 x 10- 1 


6.1 x 10" 


-2 


400K 


5.83 x 10-' 


1.5 x 10" 


-1 


800K 


1.36 x 10° 


3.8 x 10 


-1 



Here, the "error" is not the error due to the approximations in the force 
calculations. 
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Table 3: Dependence of computing speed on N: particles sorted in Morton 
order 



.declaration of I/O arrays and constants, 
.initialize variables for accumulation... 



N 


Ttotal (s) 


^construction 


Cs) 


50K 


3.00 x 10" 2 


9.1 x 10 


-3 


100K 


6.08 x 10~ 2 


1.8 x 10~ 


-2 


200K 


1.27 x 10- 1 


3.9 x 10- 


-2 


400K 


2.65 x 10- 1 


8.0 x 10" 


-2 


800K 


5.66 x 10 ' 


1.6 x lO- 


-1 



xi = load own->x 
yi = load own->y 
zi = load own->z 

cell = root 
whileloop 

break if cell is null 

xj = load cell->x 
yj = load cell->y 
zj = load cell->z 
mj = load cell->m 
s_eff = load cell->s_eff 

dx = xj - xi 

dy = yj - yi 

dz = zj - zi 

r2 = dx*dx + dy*dy + dz*dz 

if cell is a particle 

...compute particle-particle interaction, 
cell = load next 
else 

if r2 > s_eff 

. . .compute particle-cell interaction. . . 
cell = load cell->next 
else 

cell = load cell->more 
endif 
endloop 



Figure 8: Abstract IL code for our compute kernel that executes the iterative 
tree walk. 



tions. By the nature of the tree method, if two particles are 
close to each other, those particles are expected to be in the 
same cell and to interact with a similar list of particles and dis- 
tant cells. This means that if two successive particles in the 
input data are geometrically close, the tree walk for the sec- 
ond particle almost certainly takes less time owing to a higher 
cache-hit rate. To accomplish such a situation, we can sort the 
particles to ensure that successive particles are as close as pos- 
sible together. Fortunately, such sorting is easily available with 
the tree method by traversing the tree in depth-first order. In 
the course of the traversal, we add each particle encountered at 
a leaf node to a list. After the tree traversal, we can use the 
list obtained to shuffle the particles so that the order of parti- 
cles is nearly the desired order. This ordering of particles is 
called the Morton ordering. With this preprocessing, the speed 
of our method was altered as shown in Table [3] Note that the 
time in Table [3] does not contain the time required for the pre- 
processing. This is adequate, since in astrophysical many-body 
simulations, the tree structure is repeatedly constructed at each 
time step so that we can automatically obtain this sorting for 
free. We observed that r to tai obtained with the Morton ordering 
was faster by a factor of 1.5-2.2, depending on N, than without 
the preprocessing. Moreover, ^construction also decreased in all 
cases owing to better cache usage on the host. With the Morton 
ordering, the tree construction took roughly 14-27% of r tota i. 

The programming API for the Cypress GPU has a facility to 
report the cache-hit rate for the GPU. In Tabled we show how 
the cache-hit rate depends on N and the ordering of the parti- 
cles. The results indicate that the performance of our method 
is significantly affected by the ordering of the particles. In the 
tests described in the following, we always used preprocessing. 
Note that we could have obtained even better results if we had 
sorted the particles in the Peano-Hilbert order, which has been 
reported to be the optimal order for locality of data access , and 
is used by some tree codes (e.g JWarren & SalmonL ll993). 

In Figure [9] we present r tota i as a function of N for three 
cases: the tree method with Morton ordering, the tree method 
without sorting, and the brute force method. Except for N < 
30, 000, the tree method with Morton ordering (6 = 0.6) out- 
performs the brute force method on the GPU. 

In Figure [TUl we compare the performance for the follow- 
ing three cases: (a) a kernel written in IL running on a Cypress 
GPU, (b) a kernel written in C for OpenCL running on a Cy- 
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Table 4: Dependence of cache-hit rate on 
cles 


N tor different ordenngs ot the parti- 


1 


OpenCL GPU 

N log N 




N 


No sorting (%) 


Morton ordering (%) 






50K 


75 


93 






^ ■ 


100K 


63 


91 


§ 0.1 






200K 


55 


87 








400K 


48 


85 




******** 




800K 


43 


80 














0.01 







IL (no sort) ■ 
IL {Morton) 

IL (brute force) ■ 

N log N ■ 

N 2 



Figure 9: Comparison between the tree method on a GPU and the brute force 
method on a GPU. T toVd \ as a function of N is plotted for three cases. The W 2 
and N log N scaling lines are also plotted for reference. 



press GPU, and (c) a kernel written in C for OpenCL running 
on a multicore CPU. Since our test system had eight physical 
(16 logical) cores, the OpenCL kernel ran on the CPU with 16 
threads. The last two cases show almost identical performance 
even though the theoretical performance in single precision for 
the Cypress GPU is ~2 times faster. In fact, the tree method is 
not compute-intensive but is limited by memory bandwidth, and 
hence the effective performance of the compute kernel written 
in IL is roughly ~ 1 % of the theoretical performance in single- 
precision. On the other hand, the performance gap between the 
two kernels written in IL and C for OpenCL is a factor of ~2.5. 
We believe that one of the main reasons is that our compute 
kernel written in C for OpenCL is executed without using LI 
cache. We will investigate further optimizations of the OpenCL 
kernel in future work. 

Next, we examine how T^tai depends on 8, which controls the 
error bound for the tree method. A larger 8 means that more of 
the distant particles are replaced by a multipole expansion. In 
other words, for a smaller 8, we need to perform a larger number 
of force calculations, and hence the computation will take a 
longer time. At the same time, the error due to the multipole 
expansion decreases. Practically, a force calculation by the tree 
method with 8 < 0.1 is reduced to almost the same level as 
a brute-force computation. In such a regime, effectively, we 
do not have any preference for the tree method. In Table [5] we 



Figure 10: Comparison between the tree method using a kernel written in IL, 
the tree method using a kernel written in OpenCL on a GPU, and the tree 
method using a kernel written in OpenCL on a CPU. The N log N scaling line 
is also plotted for reference. 



Table 5: Dependence of 7" to tal and the cache-hit rate on 6 for N = 800K 



8 


^total (S) 


Cache-hit rate (%) 


0.2 


1.65 x 10 1 


23 


0.3 


5.24 x 10° 


36 


0.4 


2.24 x 10° 


48 


0.5 


1.14 x 10° 


59 


0.6 


6.82 x 10- 1 


68 


0.7 


4.92 x 10- 1 


75 


0.8 


3.98 x 10- 1 


80 


0.9 


3.43 x 10- 1 


80 


1.0 


3.10 x 10- 1 


82 



show the dependence of r tota i and the cache-hit rate on 8. In this 
test, we used N = 800K particles. In all of the tests that we have 
presented so far, a clear trend is that the computing time seems 
to be determined solely by the cache-hit rate. Before we had 
done these tests, we expected that branch operations would be 
a bottleneck for the compute kernel. In reality, all that matters 
is the cache-hit rate. 

Finally, we measured the average error in the accelerations 
using the following equation: 



N-l 



N ^ 



^directi 



(4) 



!=0 



where a tKe and a dllect are the acceleration forces obtained by 
the tree method on the GPU and by the brute force method on 
the host computer, respectively. We use a similar definition for 
the error in the potential, terror- For this test, we used a realis- 
tic particle distribution that represented a disk galaxy. We used 
GalactlCS (Kuiiken & Dubi nski , 1995 ) to generate the particle 
distribution. The particle distribution had three components, 
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namely a bulge, a disk, and a halo, with a mass ratio of approx- 
imately 1:2:12. Tables [6] and |7]present a e n-or and /? e n-or, respec- 
tively, for several different values of N and 6. Both a amr and 
p enol depend on 9 as expected. Except for N = 500K, we have 
a smaller a e nor for larger N. We found no systematic error in 
a,- computed by the tree code on the GPU. However, it would 
be desirable to use double-precision variables for accumulation 
of a -, to reduce a e n- i- for large N > 10 5 . There is only a negli- 
gible difference between the results computed by the compute 
kernels written in IL and in C for OpenCL. 

6. Comparison with other work 

6.1. Octree textures on a GPU 



Lefebvre et al.l (120051) have implemented an tree data struc- 



ture for a texture-mapping and tree traversal algorithm on a 
GPU. Owing to the limitations of GPUs and on the SDK for 
the GPUs at that time, their method seemed to be restricted to 
applications in computer graphics. A critical point is that the 
possible depth of the tree was limited, so that we cannot di- 
rectly employ this implementation for our purposes. 

6.2. Another tree implementation on a GPU 

Gabur ov" et al.l ( 2010h have reported another implementation 



of the tree method on a GPU. Our implementation and their 
approach share the same strategy, but there are differences in 
detail, aside from the GPU architecture adopted. Both of us 
ha ve implemented a tree walk on a GPU. The implementation 
of Gaburov et al. ( 2010l) constructs interaction lists by means 
of a tree walk on a GPU and then computes the force on the 
GPU using these interaction lists. This implementation requires 
three invocations of kernels. In contrast, we do a tree walk and 
compute the force on-the-fiy with one kernel invo cation 



Both approaches have pros and cons. With the lGaburov et al 
(1201 Oh approach, a fairly high compute efficiency (~ 100 
Gfiops) has been obtained, whereas our code shows low ef- 
ficienc y (~ 30 Gfiops). On the other hand, iGaburov et al 
feOlOh 's code requires more floating-point operations than does 
our optimal tree code. We believe t h at our implementation 
is simpler than that of Gaburov et al. ( 2010h . which requires 
multi-pass computations. And we also believe that our im- 
plementation is easier to extend to a general tree walk. In 
fact, we have e xtended our compute kernel written in I L to 
the SPH method dGingold & Monaghanl[l977HLucvill977b and 
obtained fairly good performance. 

6.3. Fast multipole method on a GPU 



Gume rov & Du raiswami (2008) have reported an implemen- 



tation of a fast multipole method (FMM) on a GPU. The FMM 
is a sophisticated algorithm to evaluate long-range interaction 
forces with a computational complexity of O(N). In the FMM, 
in addition to the replacement of distant particles with multi- 
pole expansions, local expansions ar e utilized to evaluate the 
force acting on a group of particles. iGumerov & Duraiswamil 
d2008l) reported that for N = 1,048,576, the algorithm too k 
0.68 s with p = 4 (Table 8 of IGumerov & Duraiswamil d2008h ). 



where p is a parameter that controls the error bound. Figure 
10 of IGumerov & Duraiswam i (2008) indicates that the aver- 
age relative error obtained with p — 4 was ~2xl0~ 4 for the 
potential. The error is comparable to the relative error obtained 
with the tree method with 6 ~ 0.5-0.6. Note that Gumerov and 
Duraiswami used a random particle distribution in a cube. For 
comparison, we did a test with a similar particle distribution. 
Our kernel written in IL took 0.65 s for N = 1, 048, 576 with 
= 0.6. The cache-hit rate of the test was 81%. The perfor- 
mance of our tree code and that obtained by Gumerov and Du- 
raiswami with the FMM is comparable. Note that the Cypress 
GPU used in the present work was different from and newer 
than the GPU that Gumerov and Duraiswami used. Generally, 
the FMM is well suited to applications that require long-range 
interaction forces for uniformly distributed particles or sources, 
whereas the tree method is more robust to the highly clustered 
particles that typically arise in astrophysical many-body simu- 
lations. We believe that our method is more suitable than that of 



Gumerov & Durais wami (2008) for our purpose of astrophysi- 
cal applications. 

7. Conclusions 

In this paper, we have described our implementation of the 
tree method on a Cypress GPU. By transforming a recursive 
tree traversal into an iterative procedure, we have shown that 
the execution of a tree traversal together with a force calcula- 
tion on a GPU can be practical and efficient. In addition, our 
implementation shows performance comparable to that of a re- 
cently reported FMM code on a GPU. 

We can expect to get further performance gains by fully uti- 
lizing the four-vector SIMD operations of the SCs of the GPU. 
Moreover, since 10-20% of r to t a i is spent on the tree construc- 
tion, parallelization of this part on a multicore CPU will be an 
effective way to boost the total performance. Provided that we 
can easily extend our code to implement a force calculation for 
short-range interactions by a method such as the SPH method, 
we believe that a future extended version of our code will en- 
able us to do a realistic astrophysical simulation that involves 
baryon physics with N > 1, 000, 000 very rapidly. It is fairly 
easy to incorporate higher-order multipole expansion terms into 
our method, and it would be a natural extension to the present 
work. Another good application of the tree method on a GPU 
would be to simulations that ado pt a symmetrized Plummer po- 
tential dSaitoh & Makinol l2010h . We believe that our method 
is the best for implementing that proposal, and hence that we 
shall certainly obtain better accuracy and good performance in 
simulating galaxy evolution and formation with different mass 
resolutions. 
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Table 6: Dependence of the average acceleration error a enoI on N and 8. N is a multiple of 1024. 



e 


N = 10K 


7Y = 50K 


N = 100K 


N = 200K 
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2.93e-04 
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Table 7: Dependence of the average potential error p amI on N and 9. 
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Appendix A. Working OpenCL code 

In this appendix, we present our compute kernel written in C 
for OpenCL. We have tested this kernel with ATI Stream SDK 
2.2. The function tree_gm is an entry point to the kernel, n is 
the number of particles, pos [i] is a f loat4 variable for the 
positions and masses, i.e., and m,-. acc_g[i] is a float4 
variable for the accelerations and the potential, i.e., a, and pj, 
next [] and more [] are arrays that contains the two pointers 
described in Section l4~4l For all arrays, we assume that the data 
for the particles is at indices ; from to n - 1. The data for 
the cells resides at i >= n. Finally, size [] contains softening 
parameters of the particles for i < n and the size of the cells 
for i >- n. We believe that it will be straightforward to prepare 
these arrays from any tree implementation. 
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float4 g(float4 dx, float r2, float mj) 
{ 

float rli = native_rsqrt (r2) ; 
float r2i = rli*rli; 
float rlim = mj*rli; 
float r3im = rlim*r2i; 
float4 f ; 

f .x = dx.x*r3im; 
f .y = dx.y*r3im; 
f .z = dx.z*r3im; 
f .w = rlim; 

return f ; 

} 

kernel void 

tree_gm( global float4 *pos, 

global float *size, 

global int *next , 

global int *more, 

global float4 *acc_g, 

int root , int n) 

{ 

unsigned int gid = get_global_id(0) ; 
float4 p = pos[gid] ; 

float4 acc = (f loat4) (0 . Of , . Of , O.Of, O.Of); 

int cur = root ; 
while (cur != -1) { 

float4 q = pos [cur] ; 

float mj = q.w; 

float s = size [cur]; 

float4 dx = q - p; 

float r2 = dx.x*dx.x + dx.y*dx.y + dx.z*dx.z; 

if (cur < n) {. 

if (r2 != O.Of) { 
r2 += s; 

acc += g(dx, r2, mj); 

} 

cur = next [cur] ; 
}■ else { 

if (s < r2) { 

acc += g(dx, r2, mj); 

cur = next [cur] ; 
} else { 

cur = more [cur] ; 

} 

} 

} 

acc_g[gid] = acc; 

} 



Figure A. 1 1 : Working OpenCL code that that executes the iterative treewalk. The compute kernel that we have written in IL is mostly similar to this code. 
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